Length Normalization in Degraded Text Collections

نویسندگان

  • Amit Singhal
  • Gerard Salton
  • Chris Buckley
چکیده

Optical character recognition (OCR) is the most commonly used technique to convert printed material into electronic form. Using OCR, large repositories of machine readable text can be created in a short time. An information retrieval system can then be used to search through large information bases thus created. Many information retrieval systems use sophisticated term weighting functions to improve the eeectiveness of a search. Term weighting schemes can be highly sensitive to the errors in the input text, introduced by the OCR process. This study examines the eeects of the well known cosine normalization method in the presence of OCR errors, and proposes a new, more robust, normalization method. Experiments show that the new scheme is less sensitive to OCR errors and facilitates use of more diverse basic weighting schemes. When used in a correct text collection, the new normalization scheme yields signiicant improvements in retrieval eeectiveness over cosine normalization. 1 Background OCR is widely used to scan printed text, text not otherwise available in machine readable form, and convert it into electronic form. Advances in OCR technology have decreased the error rates of OCR devices, but due to the poor quality of some printed material and other deteriorations introduced by the OCR system, the error rate of an OCR process can still be substantial..6] This poses a special challenge when scanned text collections are searched by an automatic information retrieval

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Revisiting N-Gram Based Models for Retrieval in Degraded Large Collections

The traditional retrieval models based on term matching are not effective in collections of degraded documents (output of OCR or ASR systems for instance). This paper presents a n-gram based distributed model for retrieval on degraded text large collections. Evaluation was carried out with both the TREC Confusion Track and Legal Track collections showing that the presented approach outperforms ...

متن کامل

Effects of OCR Errors on Ranking and Feedback Using the Vector Space Model

We report on the performance of the vector space model in the presence of OCR errors. We show that average precision and recall is not affected for our full text document collection when the OCR version is compared to its corresponding corrected set. We do see divergence though between the relevant document rankings of the OCR and corrected collections with different weighting combinations. In ...

متن کامل

DeepNNNER: Applying BLSTM-CNNs and Extended Lexicons to Named Entity Recognition in Tweets

In this paper, we describe the DeepNNNER entry to The 2nd Workshop on Noisy User-generated Text (WNUT) Shared Task #2: Named Entity Recognition in Twitter. Our shared task submission adopts the bidirectional LSTM-CNN model of Chiu and Nichols (2016), as it has been shown to perform well on both newswire and Web texts. It uses word embeddings trained on large-scale Web text collections together ...

متن کامل

Unsupervised Gene/Protein Named Entity Normalization Using Automatically Extracted Dictionaries

Gene and protein named-entity recognition (NER) and normalization is often treated as a two-step process. While the first step, NER, has received considerable attention over the last few years, normalization has received much less attention. We have built a dictionary based gene and protein NER and normalization system that requires no supervised training and no human intervention to build the ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1995